-
Notifications
You must be signed in to change notification settings - Fork 31
Added socket id env for multi GPU configuration #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Added socket id env for multi GPU configuration #16
Conversation
| volumeMounts: | ||
| - name: device-plugin-socket | ||
| mountPath: /var/lib/kubelet/device-plugins | ||
| resources: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did I miss resources moving somewhere else, or was this just not used?
| containers: | ||
| - name: nvshare-scheduler | ||
| image: docker.io/grgalex/nvshare:nvshare-scheduler-v0.1-f654c296 | ||
| image: nvshare:nvshare-scheduler-a4301f8e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this local dev code
| containers: | ||
| - name: nvshare-lib | ||
| image: docker.io/grgalex/nvshare:libnvshare-v0.1-f654c296 | ||
| image: nvshare:libnvshare-a4301f8e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this local dev code?
| mount = &pluginapi.Mount{ | ||
| HostPath: SocketHostPath, | ||
| ContainerPath: SocketContainerPath, | ||
| HostPath: SocketHostPathNew, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand you want this to stand out especially on your test machine. Do we want this on the main branch?
| serverSock = pluginapi.DevicePluginPath + "nvshare-device-plugin.sock" | ||
| ) | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit-pick - might not hurt to run this through a code formatter.
| return &NvshareDevicePlugin{ | ||
| devs: getDevices(), | ||
| socket: serverSock, | ||
| socket: serverSockNew, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why you did this, Could the name be something more meaningful? The next person might add another "New" on the end.
I suggest ServerSockWithId or ServerSockId
| if len(socketId) == 0 { | ||
| socketId = "0" | ||
| } | ||
| resourceNameNew:= resourceName+socketId |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resourceNameWithId?
| imagePullPolicy: IfNotPresent | ||
| env: | ||
| - name: NVSHARE_VIRTUAL_DEVICES | ||
| value: "10" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this work for a 12 GPU machine? It looks like existing code that moved. Even this were changed to 12, would this add value on a 12 GPU machine? Anyway, I entered an issue since this looks like a separate issue than the multi-GPU support.
From reviewing code not live experience: Hard coded to 10 virtual GPU's?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disclaimer I'm familiar with C# and JavaScript and have some ancient experience with Java, and C/C++; this is my first time seeing go. I am too excited to see this in production to leave it untouched.
I love this change and look forward to using it! If someone has already tested this on a single GPU machine, then I vote for an approval.
Minor Nit-picks:
- variableNameNew could be more descriptive like variableNameWithId or **variableNameId"
- Overall, the code is well formatted and easy to read, but there are a few things an automatic code formatter might improve.
|
Today I learned that by default anyone can do a code review on a PR made to a public repo. I'll take a look at this when I get a chance. However, just dropping a PR is not the correct way of doing things. Some open-source repos do that, but usually they just end up piling shit on top of shit. Each PR must have a backing issue that it addresses, and each commit should either ref ( See issue #11 and the associated PR for a real example of the flow I'm describing. Additionally, no commit should break the software, i.e., anyone should be able to build from source or deploy from the manifests for any given commit. In this case:
So, @AMIYAMAITY can you open an issue, describing the problem you want to address, along with a suggested plan to address it? @jake-brewer-isa Thanks for attempting to review, but it's too early for that. |
Hi @grgalex
It's been a week and more. We are using the nvshare library in the kubernetes POD and managing the GPU memory across the processes/PODs. It's working fine without any issues. Thank you for making it open for everyone, It's quite useful for GPU based projects.
Now, we are trying to integrate it with a multi GPU. Basically In a node we have 2 GPUs, we are trying to integrate both the GPUs.
Regarding multi GPU I have experimented with a couple of things-On the repo I went through some issues for multi gpus configuration. Initially I have increased the scheduler and device plugins on the deployment files and deployed the nvhare-system. Then we have deployed our project on the node. As soon as we deployed we got a socket connection error.
After that I did a fork and added an environment variable NVSHARE_SOCK_ID It's basically taking the id of the GPU based on the ID it's creating/connected to the socket. This NVSHARE_SOCK_ID and NVIDIA_VISIBLE_DEVICES we have passed a couple places like
scheduler,device-pluginsand thecontainer's. This approach is working fine without any issues. In multiple containers we can pass NVSHARE_SOCK_ID and NVIDIA_VISIBLE_DEVICES based on which GPU we want to process.Below some screenshots for config and deployments-
Now we have a another use case- Let's assume In a container we are passing
NVSHARE_SOCK_ID: "0"andNVIDIA_VISIBLE_DEVICES: "0"it means this container will access only the 0'th GPU only.In one of the containers there is a single process and two threads are there. Each thread utilizes GPUs by
cudaSetDevice()in cpp. In normal cases we have to passNVIDIA_VISIBLE_DEVICES: "all"then we can access and map it. But along with nvshare how we will pass and manage the socket forNVSHARE_SOCK_ID.Could you please go through my changes and help with the above problem?